Springboard Apps project - Tier 3 - Complete

Welcome to the Apps project! To give you a taste of your future career, we're going to walk through exactly the kind of notebook that you'd write as a data scientist. In the process, we'll be sure to signpost the general framework for our investigation - the Data Science Pipeline - as well as give reasons for why we're doing what we're doing. We're also going to apply some of the skills and knowledge you've built up in the previous unit when reading Professor Spiegelhalter's The Art of Statistics (hereinafter AoS).

So let's get cracking!

Brief

Did Apple Store apps receive better reviews than Google Play apps?

Stages of the project

  1. Sourcing and loading
    • Load the two datasets
    • Pick the columns that we are going to work with
    • Subsetting the data on this basis
  1. Cleaning, transforming and visualizing
    • Check the data types and fix them
    • Add a platform column to both the Apple and the Google dataframes
    • Changing the column names to prepare for a join
    • Join the two data sets
    • Eliminate the NaN values
    • Filter only those apps that have been reviewed at least once
    • Summarize the data visually and analytically (by the column platform)
  1. Modelling
    • Hypothesis formulation
    • Getting the distribution of the data
    • Permutation test
  1. Evaluating and concluding
    • What is our conclusion?
    • What is our decision?
    • Other models we could have used.

Importing the libraries

In this case we are going to import pandas, numpy, scipy, random and matplotlib.pyplot

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
# scipi is a library for statistical tests and visualizations 
from scipy import stats
# random enables us to generate random numbers
import random

Stage 1 - Sourcing and loading data

1a. Source and load the data

Let's download the data from Kaggle. Kaggle is a fantastic resource: a kind of social medium for data scientists, it boasts projects, datasets and news on the freshest libraries and technologies all in one place. The data from the Apple Store can be found here and the data from Google Store can be found here. Download the datasets and save them in your working directory.

In [2]:
# Now that the files are saved, we want to load them into Python using read_csv and pandas.

# Create a variable called google, and store in it the path of the csv file that contains your google dataset. 
# If your dataset is in the same folder as this notebook, the path will simply be the name of the file. 
google = 'googleplaystore.csv'

# Read the csv file into a data frame called Google using the read_csv() pandas method.
Google = pd.read_csv(google)

# Using the head() pandas method, observe the first three entries.
Google.head(3)
Out[2]:
App Category Rating Reviews Size Installs Type Price Content Rating Genres Last Updated Current Ver Android Ver
0 Photo Editor & Candy Camera & Grid & ScrapBook ART_AND_DESIGN 4.1 159 19M 10,000+ Free 0 Everyone Art & Design January 7, 2018 1.0.0 4.0.3 and up
1 Coloring book moana ART_AND_DESIGN 3.9 967 14M 500,000+ Free 0 Everyone Art & Design;Pretend Play January 15, 2018 2.0.0 4.0.3 and up
2 U Launcher Lite – FREE Live Cool Themes, Hide ... ART_AND_DESIGN 4.7 87510 8.7M 5,000,000+ Free 0 Everyone Art & Design August 1, 2018 1.2.4 4.0.3 and up
In [3]:
# Create a variable called apple, and store in it the path of the csv file that contains your apple dataset. 
apple = 'AppleStore.csv'

# Read the csv file into a pandas DataFrame object called Apple.
Apple = pd.read_csv(apple)

# Observe the first three entries like you did with your other data. 
Apple.head(3)
Out[3]:
Unnamed: 0 id track_name size_bytes currency price rating_count_tot rating_count_ver user_rating user_rating_ver ver cont_rating prime_genre sup_devices.num ipadSc_urls.num lang.num vpp_lic
0 1 281656475 PAC-MAN Premium 100788224 USD 3.99 21292 26 4.0 4.5 6.3.5 4+ Games 38 5 10 1
1 2 281796108 Evernote - stay organized 158578688 USD 0.00 161065 26 4.0 3.5 8.2.2 4+ Productivity 37 5 23 1
2 3 281940292 WeatherBug - Local Weather, Radar, Maps, Alerts 100524032 USD 0.00 188583 2822 3.5 4.5 5.0.0 4+ Weather 37 5 3 1

1b. Pick the columns we'll work with

From the documentation of these datasets, we can infer that the most appropriate columns to answer the brief are:

  1. Google:
    • Category # Do we need this?
    • Rating
    • Reviews
    • Price (maybe)
  2. Apple:
    • prime_genre # Do we need this?
    • user_rating
    • rating_count_tot
    • price (maybe)

1c. Subsetting accordingly

Let's select only those columns that we want to work with from both datasets. We'll overwrite the subsets in the original variables.

In [4]:
# Subset our DataFrame object Google by selecting just the variables ['Category', 'Rating', 'Reviews', 'Price']
Google = Google[['Category', 'Rating', 'Reviews', 'Price']]

# Check the first three entries
Google.head(3)
Out[4]:
Category Rating Reviews Price
0 ART_AND_DESIGN 4.1 159 0
1 ART_AND_DESIGN 3.9 967 0
2 ART_AND_DESIGN 4.7 87510 0
In [5]:
# Do the same with our Apple object, selecting just the variables ['prime_genre', 'user_rating', 'rating_count_tot', 'price']
Apple = Apple[['prime_genre', 'user_rating', 'rating_count_tot', 'price']]

# Let's check the first three entries
Apple.head(3)
Out[5]:
prime_genre user_rating rating_count_tot price
0 Games 4.0 21292 3.99
1 Productivity 4.0 161065 0.00
2 Weather 3.5 188583 0.00

Stage 2 - Cleaning, transforming and visualizing

2a. Check the data types for both Apple and Google, and fix them

Types are crucial for data science in Python. Let's determine whether the variables we selected in the previous section belong to the types they should do, or whether there are any errors here.

In [6]:
# Using the dtypes feature of pandas DataFrame objects, check out the data types within our Apple dataframe.
# Are they what you expect?
Apple.dtypes
Out[6]:
prime_genre          object
user_rating         float64
rating_count_tot      int64
price               float64
dtype: object

This is looking healthy. But what about our Google data frame?

In [7]:
# Using the same dtypes feature, check out the data types of our Google dataframe. 
Google.dtypes
Out[7]:
Category     object
Rating      float64
Reviews      object
Price        object
dtype: object

Weird. The data type for the column 'Price' is 'object', not a numeric data type like a float or an integer. Let's investigate the unique values of this column.

In [8]:
# Use the unique() pandas method on the Price column to check its unique values. 
Google.Price.unique()
Out[8]:
array(['0', '$4.99', '$3.99', '$6.99', '$1.49', '$2.99', '$7.99', '$5.99',
       '$3.49', '$1.99', '$9.99', '$7.49', '$0.99', '$9.00', '$5.49',
       '$10.00', '$24.99', '$11.99', '$79.99', '$16.99', '$14.99',
       '$1.00', '$29.99', '$12.99', '$2.49', '$10.99', '$1.50', '$19.99',
       '$15.99', '$33.99', '$74.99', '$39.99', '$3.95', '$4.49', '$1.70',
       '$8.99', '$2.00', '$3.88', '$25.99', '$399.99', '$17.99',
       '$400.00', '$3.02', '$1.76', '$4.84', '$4.77', '$1.61', '$2.50',
       '$1.59', '$6.49', '$1.29', '$5.00', '$13.99', '$299.99', '$379.99',
       '$37.99', '$18.99', '$389.99', '$19.90', '$8.49', '$1.75',
       '$14.00', '$4.85', '$46.99', '$109.99', '$154.99', '$3.08',
       '$2.59', '$4.80', '$1.96', '$19.40', '$3.90', '$4.59', '$15.46',
       '$3.04', '$4.29', '$2.60', '$3.28', '$4.60', '$28.99', '$2.95',
       '$2.90', '$1.97', '$200.00', '$89.99', '$2.56', '$30.99', '$3.61',
       '$394.99', '$1.26', 'Everyone', '$1.20', '$1.04'], dtype=object)

Aha! Fascinating. There are actually two issues here.

  • Firstly, there's a price called Everyone. That is a massive mistake!
  • Secondly, there are dollar symbols everywhere!

Let's address the first issue first. Let's check the datapoints that have the price value Everyone

In [9]:
# Let's check which data points have the value 'Everyone' for the 'Price' column by subsetting our Google dataframe.

# Subset the Google dataframe on the price column. 
# To be sure: you want to pick out just those rows whose value for the 'Price' column is just 'Everyone'. 
Google[Google.Price == 'Everyone']
Out[9]:
Category Rating Reviews Price
10472 1.9 19.0 3.0M Everyone

Thankfully, it's just one row. We've gotta get rid of it.

In [10]:
# Let's eliminate that row. 

# Subset our Google dataframe to pick out just those rows whose value for the 'Price' column is NOT 'Everyone'. 
# Reassign that subset to the Google variable. 
# You can do this in two lines or one. Your choice! 
Google = Google[Google.Price != 'Everyone']

# Check again the unique values of Google
Google.Price.unique()
Out[10]:
array(['0', '$4.99', '$3.99', '$6.99', '$1.49', '$2.99', '$7.99', '$5.99',
       '$3.49', '$1.99', '$9.99', '$7.49', '$0.99', '$9.00', '$5.49',
       '$10.00', '$24.99', '$11.99', '$79.99', '$16.99', '$14.99',
       '$1.00', '$29.99', '$12.99', '$2.49', '$10.99', '$1.50', '$19.99',
       '$15.99', '$33.99', '$74.99', '$39.99', '$3.95', '$4.49', '$1.70',
       '$8.99', '$2.00', '$3.88', '$25.99', '$399.99', '$17.99',
       '$400.00', '$3.02', '$1.76', '$4.84', '$4.77', '$1.61', '$2.50',
       '$1.59', '$6.49', '$1.29', '$5.00', '$13.99', '$299.99', '$379.99',
       '$37.99', '$18.99', '$389.99', '$19.90', '$8.49', '$1.75',
       '$14.00', '$4.85', '$46.99', '$109.99', '$154.99', '$3.08',
       '$2.59', '$4.80', '$1.96', '$19.40', '$3.90', '$4.59', '$15.46',
       '$3.04', '$4.29', '$2.60', '$3.28', '$4.60', '$28.99', '$2.95',
       '$2.90', '$1.97', '$200.00', '$89.99', '$2.56', '$30.99', '$3.61',
       '$394.99', '$1.26', '$1.20', '$1.04'], dtype=object)

Our second problem remains: I'm seeing dollar symbols when I close my eyes! (And not in a good way).

This is a problem because Python actually considers these values strings. So we can't do mathematical and statistical operations on them until we've made them into numbers.

In [11]:
# Let's create a variable called nosymb.
# This variable will take the Price column of Google and apply the str.replace() method. 
nosymb = Google.Price.str.replace('$','')

# Now we need to do two things:
# i. Make the values in the nosymb variable numeric using the to_numeric() pandas method.
# ii. Assign this new set of numeric, dollar-sign-less values to Google['Price']. 
# You can do this in one line if you wish.
Google.Price = pd.to_numeric(nosymb)
<ipython-input-11-b82fc5faae68>:3: FutureWarning: The default value of regex will change from True to False in a future version. In addition, single character regular expressions will*not* be treated as literal strings when regex=True.
  nosymb = Google.Price.str.replace('$','')
In [12]:
nosymb.unique()
Out[12]:
array(['0', '4.99', '3.99', '6.99', '1.49', '2.99', '7.99', '5.99',
       '3.49', '1.99', '9.99', '7.49', '0.99', '9.00', '5.49', '10.00',
       '24.99', '11.99', '79.99', '16.99', '14.99', '1.00', '29.99',
       '12.99', '2.49', '10.99', '1.50', '19.99', '15.99', '33.99',
       '74.99', '39.99', '3.95', '4.49', '1.70', '8.99', '2.00', '3.88',
       '25.99', '399.99', '17.99', '400.00', '3.02', '1.76', '4.84',
       '4.77', '1.61', '2.50', '1.59', '6.49', '1.29', '5.00', '13.99',
       '299.99', '379.99', '37.99', '18.99', '389.99', '19.90', '8.49',
       '1.75', '14.00', '4.85', '46.99', '109.99', '154.99', '3.08',
       '2.59', '4.80', '1.96', '19.40', '3.90', '4.59', '15.46', '3.04',
       '4.29', '2.60', '3.28', '4.60', '28.99', '2.95', '2.90', '1.97',
       '200.00', '89.99', '2.56', '30.99', '3.61', '394.99', '1.26',
       '1.20', '1.04'], dtype=object)

Now let's check the data types for our Google dataframe again, to verify that the 'Price' column really is numeric now.

In [13]:
# Use the function dtypes. 
Google.Price.unique()
Out[13]:
array([  0.  ,   4.99,   3.99,   6.99,   1.49,   2.99,   7.99,   5.99,
         3.49,   1.99,   9.99,   7.49,   0.99,   9.  ,   5.49,  10.  ,
        24.99,  11.99,  79.99,  16.99,  14.99,   1.  ,  29.99,  12.99,
         2.49,  10.99,   1.5 ,  19.99,  15.99,  33.99,  74.99,  39.99,
         3.95,   4.49,   1.7 ,   8.99,   2.  ,   3.88,  25.99, 399.99,
        17.99, 400.  ,   3.02,   1.76,   4.84,   4.77,   1.61,   2.5 ,
         1.59,   6.49,   1.29,   5.  ,  13.99, 299.99, 379.99,  37.99,
        18.99, 389.99,  19.9 ,   8.49,   1.75,  14.  ,   4.85,  46.99,
       109.99, 154.99,   3.08,   2.59,   4.8 ,   1.96,  19.4 ,   3.9 ,
         4.59,  15.46,   3.04,   4.29,   2.6 ,   3.28,   4.6 ,  28.99,
         2.95,   2.9 ,   1.97, 200.  ,  89.99,   2.56,  30.99,   3.61,
       394.99,   1.26,   1.2 ,   1.04])

Notice that the column Reviews is still an object column. We actually need this column to be a numeric column, too.

In [14]:
# Convert the 'Reviews' column to a numeric data type. 
Google.loc['Reviews'] = pd.to_numeric(Google.Reviews)
In [15]:
# Let's check the data types of Google again
Google.dtypes
Out[15]:
Category     object
Rating      float64
Reviews      object
Price       float64
dtype: object

2b. Add a platform column to both the Apple and the Google dataframes

Let's add a new column to both dataframe objects called platform: all of its values in the Google dataframe will be just 'google', and all of its values for the Apple dataframe will be just 'apple'.

The reason we're making this column is so that we can ultimately join our Apple and Google data together, and actually test out some hypotheses to solve the problem in our brief.

In [16]:
# Create a column called 'platform' in both the Apple and Google dataframes. 
# Add the value 'apple' and the value 'google' as appropriate. 
Apple['platform'] = 'Google'
Google['platform'] = 'Apple'

2c. Changing the column names to prepare for our join of the two datasets

Since the easiest way to join two datasets is if they have both:

  • the same number of columns
  • the same column names we need to rename the columns of Apple so that they're the same as the ones of Google, or vice versa.

In this case, we're going to change the Apple columns names to the names of the Google columns.

This is an important step to unify the two datasets!

In [17]:
# Create a variable called old_names where you'll store the column names of the Apple dataframe. 
# Use the feature .columns.
old_names = Apple.columns

# Create a variable called new_names where you'll store the column names of the Google dataframe. 
new_names = Google.columns

# Use the rename() DataFrame method to change the columns names. 
Apple = Apple.rename(columns = dict(zip(old_names,new_names)))
In [18]:
Apple
Out[18]:
Category Rating Reviews Price platform
0 Games 4.0 21292 3.99 Google
1 Productivity 4.0 161065 0.00 Google
2 Weather 3.5 188583 0.00 Google
3 Shopping 4.0 262241 0.00 Google
4 Reference 4.5 985920 0.00 Google
... ... ... ... ... ...
7192 Games 4.5 142 0.00 Google
7193 Games 4.5 30 0.00 Google
7194 Utilities 4.5 15 1.99 Google
7195 Games 4.5 85 0.00 Google
7196 Games 5.0 3 0.00 Google

7197 rows × 5 columns

2d. Join the two datasets

Let's combine the two datasets into a single data frame called df.

In [19]:
# Let's use the append() method to append Apple to Google. 
Google = Google.append(Apple)

# Using the sample() method with the number 12 passed to it, check 12 random points of your dataset.
Google.sample(12)
Out[19]:
Category Rating Reviews Price platform
3665 VIDEO_PLAYERS 4.3 25655305 0.0 Apple
2534 Social Networking 4.0 427 0.0 Google
1828 GAME 4.5 73539 0.0 Apple
5787 FINANCE 4.6 4546 0.0 Apple
4364 TOOLS 4.2 38 0.0 Apple
6864 BUSINESS NaN 10 0.0 Apple
9844 NEWS_AND_MAGAZINES 3.9 878065 0.0 Apple
5472 FAMILY 4.6 87 0.0 Apple
1354 Entertainment 2.5 39436 0.0 Google
7017 Social Networking 4.0 3060 0.0 Google
1708 Education 4.5 18749 0.0 Google
6453 FINANCE 4.2 798 0.0 Apple

2e. Eliminate the NaN values

As you can see there are some NaN values. We want to eliminate all these NaN values from the table.

In [20]:
# Lets check first the dimesions of df before droping `NaN` values. Use the .shape feature. 
print(Google.shape)

# Use the dropna() method to eliminate all the NaN values, and overwrite the same dataframe with the result. 
Google.dropna(inplace=True)

# Check the new dimesions of our dataframe.
print(Google.shape)
(18038, 5)
(16563, 5)

2f. Filter the data so that we only see whose apps that have been reviewed at least once

Apps that haven't been reviewed yet can't help us solve our brief.

So let's check to see if any apps have no reviews at all.

In [21]:
# Subset your df to pick out just those rows whose value for 'Reviews' is equal to 0. 
# Do a count() on the result. 
df = Google[Google.Reviews == 0]
df.count()
Out[21]:
Category    929
Rating      929
Reviews     929
Price       929
platform    929
dtype: int64

929 apps do not have reviews, we need to eliminate these points!

In [22]:
# Eliminate the points that have 0 reviews.
Google = Google[Google.Reviews != 0]

2g. Summarize the data visually and analytically (by the column platform)

What we need to solve our brief is a summary of the Rating column, but separated by the different platforms.

In [23]:
# To summarize analytically, let's use the groupby() method on our df.
Google.groupby('platform').mean()
Out[23]:
Rating Price
platform
Apple 4.191757 0.960928
Google 4.049697 1.821977

Interesting! Our means of 4.049697 and 4.191757 don't seem all that different! Perhaps we've solved our brief already: there's no significant difference between Google Play app reviews and Apple Store app reviews. We have an observed difference here: which is simply (4.191757 - 4.049697) = 0.14206. This is just the actual difference that we observed between the mean rating for apps from Google Play, and the mean rating for apps from the Apple Store. Let's look at how we're going to use this observed difference to solve our problem using a statistical test.

Outline of our method:

  1. We'll assume that platform (i.e, whether the app was Google or Apple) really doesn’t impact on ratings.
  1. Given this assumption, we should actually be able to get a difference in mean rating for Apple apps and mean rating for Google apps that's pretty similar to the one we actually got (0.14206) just by: a. shuffling the ratings column, b. keeping the platform column the same, c. calculating the difference between the mean rating for Apple and the mean rating for Google.
  1. We can make the shuffle more useful by doing it many times, each time calculating the mean rating for Apple apps and the mean rating for Google apps, and the difference between these means.
  1. We can then take the mean of all these differences, and this will be called our permutation difference. This permutation difference will be great indicator of what the difference would be if our initial assumption were true and platform really doesn’t impact on ratings.
  1. Now we do a comparison. If the observed difference looks just like the permutation difference, then we stick with the claim that actually, platform doesn’t impact on ratings. If instead, however, the permutation difference differs significantly from the observed difference, we'll conclude: something's going on; the platform does in fact impact on ratings.
  1. As for what the definition of significantly is, we'll get to that. But there’s a brief summary of what we're going to do. Exciting!

If you want to look more deeply at the statistics behind this project, check out this resource.

Let's also get a visual summary of the Rating column, separated by the different platforms.

A good tool to use here is the boxplot!

In [24]:
# Call the boxplot() method on our df.
Google.boxplot(by='platform',column ='Rating')
plt.tight_layout();

Here we see the same information as in the analytical summary, but with a boxplot. Can you see how the boxplot is working here? If you need to revise your boxplots, check out this this link.

Stage 3 - Modelling

3a. Hypothesis formulation

Our Null hypothesis is just:

Hnull: the observed difference in the mean rating of Apple Store and Google Play apps is due to chance (and thus not due to the platform).

The more interesting hypothesis is called the Alternate hypothesis:

Halternative: the observed difference in the average ratings of apple and google users is not due to chance (and is actually due to platform)

We're also going to pick a significance level of 0.05.

3b. Getting the distribution of the data

Now that the hypotheses and significance level are defined, we can select a statistical test to determine which hypothesis to accept.

There are many different statistical tests, all with different assumptions. You'll generate an excellent judgement about when to use which statistical tests over the Data Science Career Track course. But in general, one of the most important things to determine is the distribution of the data.

In [25]:
# Create a subset of the column 'Rating' by the different platforms.
# Call the subsets 'apple' and 'google'
apple = Google[Google.platform == 'Apple']['Rating']
google = Google[Google.platform == 'Google']['Rating']
In [26]:
# Using the stats.normaltest() method, get an indication of whether the apple data are normally distributed
# Save the result in a variable called apple_normal, and print it out
apple_normal = stats.normaltest(apple)
apple_normal
Out[26]:
NormaltestResult(statistic=3678.6157187516856, pvalue=0.0)
In [27]:
# Do the same with the google data. 
google_normal = stats.normaltest(google)
google_normal
Out[27]:
NormaltestResult(statistic=1778.9974234584017, pvalue=0.0)

Since the null hypothesis of the normaltest() is that the data are normally distributed, the lower the p-value in the result of this test, the more likely the data are to be non-normal.

Since the p-values is 0 for both tests, regardless of what we pick for the significance level, our conclusion is that the data are not normally distributed.

We can actually also check out the distribution of the data visually with a histogram. A normal distribution has the following visual characteristics:

- symmetric
- unimodal (one hump)

As well as a roughly identical mean, median and mode.

In [28]:
# Create a histogram of the apple reviews distribution
plt.hist(apple);
In [29]:
# Create a histogram of the google data
plt.hist(google);

3c. Permutation test (non-parametric test )

Since the data aren't normally distributed, we're using a non-parametric test here. This is simply a label for statistical tests used when the data aren't normally distributed. These tests are extraordinarily powerful due to how few assumptions we need to make.

Check out more about permutations here.

In [30]:
# Create a column called `Permutation1`, and assign to it the result of permuting (shuffling) the Rating column
# This assignment will use our numpy object's random.permutation() method
Google['Permutation1'] = np.random.permutation(Google.Rating)

# Call the describe() method on our permutation grouped by 'platform'. 
Google.groupby('platform')[['Permutation1']].describe()
Out[30]:
Permutation1
count mean std min 25% 50% 75% max
platform
Apple 9366.0 4.128902 0.617373 1.0 4.0 4.3 4.5 5.0
Google 6268.0 4.143618 0.606230 1.0 4.0 4.3 4.5 5.0
In [31]:
# Lets compare with the previous analytical summary:
Google.groupby('platform')[['Rating','Permutation1']].describe().T
Out[31]:
platform Apple Google
Rating count 9366.000000 6268.000000
mean 4.191757 4.049697
std 0.515219 0.726943
min 1.000000 1.000000
25% 4.000000 4.000000
50% 4.300000 4.500000
75% 4.500000 4.500000
max 5.000000 5.000000
Permutation1 count 9366.000000 6268.000000
mean 4.128902 4.143618
std 0.617373 0.606230
min 1.000000 1.000000
25% 4.000000 4.000000
50% 4.300000 4.300000
75% 4.500000 4.500000
max 5.000000 5.000000
In [32]:
Google.groupby('platform')[['Permutation1']].mean()
Out[32]:
Permutation1
platform
Apple 4.128902
Google 4.143618
In [33]:
np.random.seed(42)

# The difference in the means for Permutation1 (0.001103) now looks hugely different to our observed difference of 0.14206. 
# It's sure starting to look like our observed difference is significant, and that the Null is false; platform does impact on ratings
# But to be sure, let's create 10,000 permutations, calculate the mean ratings for Google and Apple apps and the difference between these for each one, and then take the average of all of these differences.
# Let's create a vector with the differences - that will be the distibution of the Null.

# First, make a list called difference.
difference = np.empty(10000)

# Now make a for loop that does the following 10,000 times:
# 1. makes a permutation of the 'Rating' as you did above
# 2. calculates the difference in the mean rating for apple and the mean rating for google. 
for i in range(10000):
    Google['Permutation1'] = np.random.permutation(Google.Rating)
    difference[i] = Google.groupby('platform')[['Permutation1']].mean().loc['Apple']-Google.groupby('platform')[['Permutation1']].mean().loc['Google']
    
In [35]:
# Make a variable called 'histo', and assign to it the result of plotting a histogram of the difference list. 
histo = plt.hist(difference)
In [38]:
# Now make a variable called obs_difference, and assign it the result of the mean of our 'apple' variable and the mean of our 'google variable'
obs_difference =  np.mean(apple) - np.mean(google)

# Make this difference absolute with the built-in abs() function. 
obs_difference = abs(obs_difference)

# Print out this value; it should be 0.1420605474512291. 
print(obs_difference)
0.1420605474512291
In [34]:
# Another way per the Datacamp in sec.11.3

np.random.seed(42)
def permutation_sample(data1, data2):
    """Generate a permutation sample from two data sets."""

    # Concatenate the data sets: data
    data = np.concatenate((data1, data2))

    # Permute the concatenated array: permuted_data
    permuted_data = np.random.permutation(data)

    # Split the permuted array into two: perm_sample_1, perm_sample_2
    perm_sample_1 = permuted_data[:len(data1)]
    perm_sample_2 = permuted_data[len(data1):]

    return perm_sample_1, perm_sample_2

def diff_of_means(data_1, data_2):
    """Difference in means of two arrays."""

    # The difference of means of data_1, data_2: diff
    diff = np.mean(data_1)-np.mean(data_2)

    return diff

def draw_perm_reps(data_1, data_2, func, size=1):
    """Generate multiple permutation replicates."""

    # Initialize array of replicates: perm_replicates
    perm_replicates = np.empty(size)

    for i in range(size):
        # Generate permutation sample
        perm_sample_1, perm_sample_2 = permutation_sample(data_1, data_2)

        # Compute the test statistic
        perm_replicates[i] = func(perm_sample_1, perm_sample_2)

    return perm_replicates

# Compute difference of mean impact force from experiment: empirical_diff_means
empirical_diff_means = diff_of_means(apple,google)

# Draw 10,000 permutation replicates: perm_replicates
perm_replicates = draw_perm_reps(apple,google,diff_of_means, size=10000)

# Compute p-value: p
p = np.sum(perm_replicates >= empirical_diff_means) / len(perm_replicates)

# Print the result
print('p-value =', p)
p-value = 0.0
In [37]:
histo = plt.hist(perm_replicates)

Stage 4 - Evaluating and concluding

4a. What is our conclusion?

''' What do we know?

Recall: The p-value of our observed data is just the proportion of the data given the null that's at least as extreme as that observed data.

As a result, we're going to count how many of the differences in our difference list are at least as extreme as our observed difference.

If less than or equal to 5% of them are, then we will reject the Null. '''


4b. What is our decision?

So actually, zero differences are at least as extreme as our observed difference!

So the p-value of our observed data is 0.

It doesn't matter which significance level we pick; our observed data is statistically significant, and we reject the Null.

We conclude that platform does impact on ratings. Specifically, we should advise our client to integrate only Google Play into their operating system interface.

4c. Other statistical tests, and next steps

The test we used here is the Permutation test. This was appropriate because our data were not normally distributed!

As we've seen in Professor Spiegelhalter's book, there are actually many different statistical tests, all with different assumptions. How many of these different statistical tests can you remember? How much do you remember about what the appropriate conditions are under which to use them?

Make a note of your answers to these questions, and discuss them with your mentor at your next call.